1 Motiavtion & Goal

Although movies are mainly made for entertainment, not earning enough profit puts filmmakers in an awkward situation that makes it difficult to keep producing high-quality movies. One most popular quality metric is the score from Internet Movie Database (IMDb). Based on the metadata from IMDb, it would be interesting to analyze what makes a movie more successful than another, commercially and critically.

So, the main goal of this project is to explore the IMDb dataset with focus on profit and IMDb score and present the findings in an intuitive and interactive way. And, if possible, to identify what features contribute to a highly rated/profitable film the most and try to predict a movie’s profitability.

These are the R packages required for this project.

library(tidyverse)
library(knitr)
library(plotly)
library(ggrepel)
library(DT)
library(tm)
library(openxlsx)

2 Data Sources & Preparation

The dataset used in this project came from the IMDb 5000 Movie Dataset from Kaggle. It recorded information on more than 5000 movies across 66 countries from 1916 to 2016. The dataset is available in a csv format file and is of size 1MB. Note that the original dataset is replaced on Kaggle website, and therefore I cannot access the original one. The following link is where I sourced the same data from: https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset (Yueming Zhang 2017).

The data preparation part consists of the following tasks:

  1. Import the data.
  2. Clean and tidy up the data to ensure its quality and make it easier to manipulate.
  3. Detail data description.

2.1 Load Data

First, I import the data and check the dimension and names of all attributes of the data. There are 5043 movies recorded, and each record has 28 attributes, including information such as “Movie’s Title”, “Director’s Name”, “Budget of the movie”, and “IMDb score of the movie”.

url <- "https://raw.githubusercontent.com/TimotheusHuang/msds597finalproj/main/movie_metadata.csv"
IMDB <- as_tibble(read.csv(url, stringsAsFactors = FALSE))
dim(IMDB)
## [1] 5043   28
colnames(IMDB)
##  [1] "color"                     "director_name"            
##  [3] "num_critic_for_reviews"    "duration"                 
##  [5] "director_facebook_likes"   "actor_3_facebook_likes"   
##  [7] "actor_2_name"              "actor_1_facebook_likes"   
##  [9] "gross"                     "genres"                   
## [11] "actor_1_name"              "movie_title"              
## [13] "num_voted_users"           "cast_total_facebook_likes"
## [15] "actor_3_name"              "facenumber_in_poster"     
## [17] "plot_keywords"             "movie_imdb_link"          
## [19] "num_user_for_reviews"      "language"                 
## [21] "country"                   "content_rating"           
## [23] "budget"                    "title_year"               
## [25] "actor_2_facebook_likes"    "imdb_score"               
## [27] "aspect_ratio"              "movie_facebook_likes"

Here is the preview of the raw data.

DT::datatable(IMDB %>% top_n(n = 100),
              options = list(
                # columnDefs = list(list(className = 'dt-center', target = 0)),
                autoWidth = TRUE,
                pageLength = 5,
                lengthMenu = c(5, 10, 15, 20),
                initComplete = JS(
                  "function(settings, json) {",
                  "$(this.api().table().header()).css({'background-color': '#4AA4DE', 'color': '#000'});",
                  "}"
                )
              ))

2.2 Data Tidying & Cleansing

Then, to get the data ready for analysis, some tidying and cleansing works need to be done. First, unnecessary characters in the movie_title, genre and plot_keyword columns were removed.

IMDB <- IMDB %>%
  mutate(movie_title = str_replace_all(movie_title, "\\Â", "")) %>%
  mutate(genres_new = str_replace_all(genres, "\\|", " ")) %>%
  mutate(plot_keywords_new = str_replace_all(plot_keywords, "\\|", " "))

Second, duplicates in movie_title column were removed as they may affect later analysis. In total, 126 duplicate movies were removed.

sum(duplicated(IMDB$movie_title))
## [1] 126
IMDB <- IMDB[!duplicated(IMDB$movie_title),]

Third, columns that contain currency, such as the budget and gross, may cause problems in later analysis because a few countries were not converted to US dollars, such as “South Korea” and “Japan”. Furthermore, given that all of them are converted to US dollars, we still need to consider the inflation, which makes the problem even more complicated. Thus, only movies from USA were kept for the profitability analysis.

IMDB %>% select(movie_title, budget, country, gross) %>% arrange(desc(budget))
## # A tibble: 4,917 x 4
##    movie_title                 budget country        gross
##    <chr>                        <dbl> <chr>          <int>
##  1 Lady Vengeance          4200000000 South Korea   211667
##  2 Fateless                2500000000 Hungary       195888
##  3 Princess Mononoke       2400000000 Japan        2298191
##  4 Steamboy                2127519898 Japan         410388
##  5 Akira                   1100000000 Japan         439162
##  6 Godzilla 2000           1000000000 Japan       10037390
##  7 Kabhi Alvida Naa Kehna   700000000 India        3275443
##  8 Tango                    700000000 Spain        1687311
##  9 Kites                    600000000 India        1602466
## 10 Red Cliff                553632000 China         626809
## # ... with 4,907 more rows

Then, an new column profitable was added to indicate if a movie is profitable, ‘1’ means profitable (that is, profit \(>\) budget). As this involving both gross and budget columns, only movies from USA have non-empty value for this column. This version of data is saved to an Excel file for people who want to focus only on USA films.

IMDB.USA <- IMDB %>%
  filter(country == "USA") %>%
  mutate(profitable = as.factor(ifelse(gross > budget, 1, 0)))

write.xlsx(IMDB.USA, file = "IMDB-USA.xlsx")

Last, regarding records that contain missing values, to keep the entire dataset as complete as possible, I decided to not remove any rows with missing data and to handle this issue for each individual analysis. For example, when doing genre-wise analysis, those without values for genre variables are excluded from the analysis.

After the cleaning, there are 4917 records remain and each have 30 attributes, and then these are saved to a csv file for future use.

dim(IMDB)
## [1] 4917   30
colnames(IMDB)
##  [1] "color"                     "director_name"            
##  [3] "num_critic_for_reviews"    "duration"                 
##  [5] "director_facebook_likes"   "actor_3_facebook_likes"   
##  [7] "actor_2_name"              "actor_1_facebook_likes"   
##  [9] "gross"                     "genres"                   
## [11] "actor_1_name"              "movie_title"              
## [13] "num_voted_users"           "cast_total_facebook_likes"
## [15] "actor_3_name"              "facenumber_in_poster"     
## [17] "plot_keywords"             "movie_imdb_link"          
## [19] "num_user_for_reviews"      "language"                 
## [21] "country"                   "content_rating"           
## [23] "budget"                    "title_year"               
## [25] "actor_2_facebook_likes"    "imdb_score"               
## [27] "aspect_ratio"              "movie_facebook_likes"     
## [29] "genres_new"                "plot_keywords_new"
write.xlsx(IMDB, file = "IMDB-cleaned.xlsx")

And there are 3700 rows that do not have any missing value and they are also saved to a csv file.

sum(complete.cases(IMDB %>% drop_na()))
## [1] 3700
write.xlsx(IMDB %>% drop_na(), file = "IMDB-complete-only.xlsx")

2.3 Data Description

The following table gives the name, type, and description of each variable in the dataset.

Var.tab <- tibble(
  Name = colnames(IMDB),
  Type = sapply(IMDB, typeof),
  Description = c(
    "Colorization: `Color` or `Black and White`",
    "Name of the director",
    "Number of Critical Reviews",
    "Duration of the movie in Minutes",
    "Number of FB Page Likes of Director",
    "Number of FB Page Likes of Actor No.3",
    "Name of Actor No.2",
    "Number of FB Page Likes of Actor No.1",
    "Gross Earned in US Dollars",
    "Classification: `Action`, `Comedy`, `Drama`, ..., etc.",
    "Name of Actor No.1",
    "Title of the Movie",
    "Number of Voted Users on IMDB",
    "Total FB Page Likes of of the Entire Cast",
    "Name of Actor No.3",
    "Number of the Actors Featured in the Movie Poster",
    "Keywords Describing the Plot",
    "IMDB Link of the Movie",
    "Number of Users who Reviewed the Movie",
    "Language of the movie: `English`, `French`, `Chinese`, ..., etc.",
    "Country where the Movie was Produced",
    "Content rating",
    "Budget in US Dollars",
    "Year of Release",
    "Number of FB Page Likes of Actor No.2",
    "IMDB Score on a Scale of 1 to 10",
    "Aspect Ratio",
    "Number of FB Page Likes of the Film",
    "Edited `genres`",
    "Edited `plot_keywords`"
    # "Flag indicating profitability of the movie (1-profit, 0-loss)"
  )
)
knitr::kable(Var.tab, "simple")
Name Type Description
color character Colorization: Color or Black and White
director_name character Name of the director
num_critic_for_reviews integer Number of Critical Reviews
duration integer Duration of the movie in Minutes
director_facebook_likes integer Number of FB Page Likes of Director
actor_3_facebook_likes integer Number of FB Page Likes of Actor No.3
actor_2_name character Name of Actor No.2
actor_1_facebook_likes integer Number of FB Page Likes of Actor No.1
gross integer Gross Earned in US Dollars
genres character Classification: Action, Comedy, Drama, …, etc.
actor_1_name character Name of Actor No.1
movie_title character Title of the Movie
num_voted_users integer Number of Voted Users on IMDB
cast_total_facebook_likes integer Total FB Page Likes of of the Entire Cast
actor_3_name character Name of Actor No.3
facenumber_in_poster integer Number of the Actors Featured in the Movie Poster
plot_keywords character Keywords Describing the Plot
movie_imdb_link character IMDB Link of the Movie
num_user_for_reviews integer Number of Users who Reviewed the Movie
language character Language of the movie: English, French, Chinese, …, etc.
country character Country where the Movie was Produced
content_rating character Content rating
budget double Budget in US Dollars
title_year integer Year of Release
actor_2_facebook_likes integer Number of FB Page Likes of Actor No.2
imdb_score double IMDB Score on a Scale of 1 to 10
aspect_ratio double Aspect Ratio
movie_facebook_likes integer Number of FB Page Likes of the Film
genres_new character Edited genres
plot_keywords_new character Edited plot_keywords

3 Exploratory Data Analysis

The analysis is focused on four main aspects: Genre, Country, IMDb score, and how they relate to Profitability.

3.1 Genre-wise Analysis

Since most of movies in this dataset were categorized as multiple genres, some preprocessing on the genres must be done before actually analyzing the data.

3.1.1 Genres Preprocessing

First, a document-term matrix for genres was constructed using the package “TM”.

genres.dtm <- DocumentTermMatrix(Corpus(VectorSource(IMDB[!is.na(IMDB$genres_new), ]$genres_new)))
genres.dtm
## <<DocumentTermMatrix (documents: 4917, terms: 26)>>
## Non-/sparse entries: 14127/113715
## Sparsity           : 89%
## Maximal term length: 11
## Weighting          : term frequency (tf)

Then, the created document-term matrix was used to calculate frequency for each genre.

genres.freq <- colSums(as.matrix(genres.dtm))
genres.freq <- tibble(genre = names(genres.freq), count = genres.freq) %>% arrange(desc(count))
genres.freq
## # A tibble: 26 x 2
##    genre     count
##    <chr>     <dbl>
##  1 drama      2533
##  2 comedy     1847
##  3 thriller   1364
##  4 action     1113
##  5 romance    1084
##  6 adventure   888
##  7 crime       868
##  8 sci-fi      594
##  9 fantasy     583
## 10 horror      539
## # ... with 16 more rows

3.1.2 Genres Distribution

After preprocessing was done, I first try to find which genres are used the most. By plotting distribution of genres frequency, we can see that the top 5 movie genres are “Drama”, “Comedy”, “Thriller”, “Action”, and “Romance”.

genres.freq %>%
  plot_ly(
    x = ~ reorder(genre, -count),
    y = ~ count,
    type = "bar"
  ) %>%
  layout(title = "Movie Genre Frequency Distribution", xaxis = list(title = "genre"))

3.1.3 Genres vs. Profitability vs. IMDb Score

Now, we want to identify which genre tend to have higher profitability as well as ratings. As mentioned before, this part involves movies only from USA due to the currency conversion issue.

First, I calculate the average budget, gross, and profit (= gross - budget) for each genre.

IMDB.USA.genre <- IMDB.USA %>% 
  drop_na(genres_new, gross, budget, imdb_score) %>%
  # select(genres_new, budget, gross) %>%
  mutate(profit = gross - budget) %>%
  # mutate(n = row_number()) %>%
  separate_rows(genres_new, sep = ' ') %>%
  group_by(genres_new) %>%
  summarise(mean_gross = mean(gross), mean_budget = mean(budget), mean_profit = mean(profit), mean_imdb = mean(imdb_score))

IMDB.USA.genre
## # A tibble: 23 x 5
##    genres_new  mean_gross mean_budget mean_profit mean_imdb
##    <chr>            <dbl>       <dbl>       <dbl>     <dbl>
##  1 Action       85881165.   70503681.   15377484.      6.24
##  2 Adventure   107588235.   80902013.   26686222.      6.41
##  3 Animation   120847003.   85607853.   35239151.      6.64
##  4 Biography    46089398.   29135662.   16953735.      7.10
##  5 Comedy       54540529.   35059454.   19481075.      6.14
##  6 Crime        44945166.   33948142.   10997025.      6.50
##  7 Documentary  14683626.    4021501.   10662126.      6.84
##  8 Drama        42460107.   29585071.   12875036.      6.72
##  9 Family       94310623.   64866065.   29444559.      6.19
## 10 Fantasy      90539880.   66073493.   24466387.      6.23
## # ... with 13 more rows

Then, I compared them using a grouped bar chart sorted by a descending order of the mean profit. And here are what I’ve found:

  1. Top 3 most profitable genres are “Animation”, “Family”, and “Musical”.
  2. In general, films of genres of bigger budget (\(> \$80M\)) tend to earn more profit, with “Action” being the only exception.
  3. Low-budget films of genres such as “Musical” and “Comedy” are really good investments.
  4. All the top 5 popular genres (“Drama”, “Comedy”, “Thriller”, “Action”, and “Romance”) have average profit less than \(\$20M\).
  5. Most profitable genres are not highly rated.
IMDB.USA.genre %>% 
  plot_ly(
    x = ~ reorder(genres_new, -mean_profit),
    y = ~ mean_profit, 
    name = 'Mean Profit',
    type = 'bar'
  ) %>% 
  add_bars(y = ~ mean_gross, name = 'Mean Gross') %>%
  add_bars(y = ~ mean_budget, name = 'Mean Budget') %>% 
  add_lines(y = ~ mean_imdb, yaxis="y2", name = 'Mean Ratings') %>% 
  layout(title = 'Genre Profitability Breakdown (USA)', xaxis = list(title = 'Genre'), yaxis = list(title = 'US Dollars'), barmode = 'group', yaxis2 = list(overlaying = "y", side = "right"))

3.3 IMDb Score Analysis

Last, we tried to analyze what makes movies more successful in terms of the IMDb ratings.

First, to look at the basic central tendency (mean) and the variation in movie score, we plot a histogram of IMDb score along with several summary statistics. We can see that the mean IMDb Score is 6.4 and 80% of movies have a score between 7.7 and 5.1. For a movie to be a top \(10\%\) in the worlds, it has to have a score of at least 7.7.

IMDB.score <- IMDB %>% 
  drop_na() %>% 
  select(imdb_score)

summary(IMDB.score)
##    imdb_score   
##  Min.   :1.600  
##  1st Qu.:5.900  
##  Median :6.600  
##  Mean   :6.464  
##  3rd Qu.:7.200  
##  Max.   :9.300
quantile(IMDB.score$imdb_score, c(.1, .9)) 
## 10% 90% 
## 5.1 7.7
# summary(IMDB[!is.na(IMDB$imdb_score), ]$imdb_score)

Here is the IMDb score distribution.

IMDB %>%
  drop_na(imdb_score) %>%
  plot_ly(
    x = ~ imdb_score,
    type = "histogram",
    name = "IMDb Score"
  ) %>%
  add_segments(x = 5.1, xend = 5.1, y = 0, yend = 250, name = "10 Percentile") %>%
  add_segments(x = 5.9, xend = 5.9, y = 0, yend = 250, name = "1st Quantile") %>%
  add_segments(x = 6.6, xend = 6.6, y = 0, yend = 250, name = "Median") %>%
  add_segments(x = 7.2, xend = 7.2, y = 0, yend = 250, name = "3rd Quantile") %>%
  add_segments(x = 7.7, xend = 7.7, y = 0, yend = 250, name = "90 Percentile") %>%
  layout(
    title = "IMDb Score Distribution",
    xaxis = list(title = "IMDb Score"),
    yaxis = list(title = "Count")
  )

3.3.1 Interesting Facts

  • Top 10 movies with highest IMDb score
IMDB %>%
  drop_na(movie_title, imdb_score) %>%
  group_by(movie_title) %>%
  summarise(avg_imdb = mean(imdb_score)) %>%
  arrange(desc(avg_imdb)) %>%
  top_n(10, avg_imdb) %>%
  plot_ly(
    x =  ~ avg_imdb,
    y =  ~ reorder(movie_title, avg_imdb),
    type = 'bar',
    orientation = 'h',
    name = "Top 10 Movies with Highest IMDb Score",
    marker = list(color = '#00dfad')
  ) %>%
  layout(
    xaxis = list(title = "IMDb Score"),
    yaxis = list(title = "")
  )
  • Top 10 directors with highest average IMDb score
IMDB %>%
  drop_na(director_name, imdb_score) %>%
  group_by(director_name) %>%
  summarise(avg_imdb = mean(imdb_score)) %>%
  arrange(desc(avg_imdb)) %>%
  top_n(10, avg_imdb) %>%
  plot_ly(
    x =  ~ avg_imdb,
    y =  ~ reorder(director_name, avg_imdb),
    type = 'bar',
    orientation = 'h',
    name = "Top 10 Directors with Highest IMDb Score",
    marker = list(color = '#00dfad')
  ) %>%
  layout(
    xaxis = list(title = "IMDb Score"),
    yaxis = list(title = "")
  )

3.3.2 IMDb Score vs. Profitabilty

To understand the relationship between IMDB score, profit and budget, I plotted a 3D scatter plot using the package “plotly” to try to have a big picture about it.

IMDB.USA.profit <- IMDB.USA %>%
  drop_na(gross, budget) %>%
  mutate(gross = gross/10^6, budget = budget/10^6) %>%
  mutate(profit = gross - budget) %>% 
  mutate(ROI = 100*profit/budget)

From the plot, we can see that movies with higher IMDb score tend to have higher profit and significant number of movies ended up losing money. This aligns with my intuition, IMDb score and gross might be correlated as people are more willing to watch famous and highly-rated movies. We also can observe that bigger budget does not guarantee the quality.

3.3.2.1 Commercial Success v.s. Critical Acclaim

However, people may be more interested in those films with huge commerical success, the top 30 profitable films were plotted with their gross and IMDb scores. Note that the vertical and the horizontal line refers to the median gross and the median IMDb score, and bigger point means higher profit earned. Taking a closer look at relationship of these films with their IMDb ratings, we see little correlation between them. This is as expected since most highly-rated films don’t do well on box office.

P <- IMDB.USA.profit %>%
  top_n(30, profit) %>%
  ggplot(aes(
    x = imdb_score,
    y = gross,
    size = profit
  )) +
  geom_point() +
  # geom_point(aes(
  #   color = content_rating)) +
  geom_hline(aes(yintercept = median(gross))) +
  geom_vline(aes(xintercept = median(imdb_score))) +
  xlab("IMDB Score") +
  ylab("Gross Earning in Million Dollars") +
  ggtitle("Top 30 Profitable Films v.s. IMDB Scores") +
  geom_text(
    aes(label = movie_title),
    nudge_x = 0.25,
    nudge_y = 0.25,
    check_overlap = T,
    size = 3
  )

ggplotly(P)

Then, these 30 movies were plotted again, but, with their profits and IMDb scores. Similarly, bigger points mean higher IMDb scores. For movies with budget over 70 millions dollars, we can observe an upward trend close to linear, which can be inferred that bigger-budget movies tend to earn more profit. However, there’s a downward trend when the budget is less than 70 millions dollars. Having a closer look, these movies are found to be produced in the 80s or early 90s, and so their true budget should be higher with inflation being taken into consideration.

IMDB.USA.profit %>%
  select(movie_title, profit, ROI, budget, title_year, imdb_score) %>%
  arrange(desc(profit)) %>%
  top_n(30, profit) %>%
  ggplot(aes(x = budget,
             y = profit)) +
  geom_point(aes(size = imdb_score,
                 color = title_year)) +
  geom_smooth() +
  geom_text_repel(aes(label = movie_title)) +
  xlab("Budget in Million Dollars") +
  ylab("Profit in Million Dollars") +
  ggtitle("Top 30 Most Profitable Movies")

# ggplotly(P)

3.3.2.2 Return on Investment as Profitability

However, the profit earned does not give a whole picture about financial success of a movie throughout the years, so “Return on Investment (ROI)” is used to provide a different perspective about a movie’s profitability. The following graph shows top 30 highest Return on Investment movies of at least 10 millions dollars budget. As expected, films with smaller budget have higher ROI and the ROI decreases as the budget grows bigger. Yet, we can see that the ROIs for movies with over \(\$20M\) budget do not differ much.

IMDB.USA.profit %>%
  filter(budget > 10) %>% 
  top_n(30, ROI) %>%
  ggplot(aes(
    x = budget,
    y = ROI
  )) +
  geom_point(aes(
    size = imdb_score,
    color = title_year)) +
  geom_smooth() +
  geom_text_repel(
    aes(label = movie_title)
  ) +
  xlab("Budget in Million Dollars") +
  ylab("ROI in Pecentage") +
  ggtitle("Top 30 Movies with Highest ROI")

References

Yueming Zhang. 2017. IMDB 5000 Movie Dataset | Kaggle. https://www.kaggle.com/carolzhangdc/imdb-5000-movie-dataset.